Skip to content
This repository was archived by the owner on Jun 3, 2026. It is now read-only.

feat(hackathon): add benchmarking strandly script#1087

Draft
lizradway wants to merge 2 commits into
strands-agents:mainfrom
lizradway:benchmark
Draft

feat(hackathon): add benchmarking strandly script#1087
lizradway wants to merge 2 commits into
strands-agents:mainfrom
lizradway:benchmark

Conversation

@lizradway

@lizradway lizradway commented May 19, 2026

Copy link
Copy Markdown
Member

Description

Adds a strandly benchmark command that runs Strands agents against ContextBench — a code investigation benchmark that measures how well an agent finds relevant code for real GitHub issues.

The benchmark:

  • Loads tasks from ContextBench's gold-annotated dataset (parquet files with file/symbol/span annotations)
  • Runs a Strands agent (Bedrock Claude by default) with bash tool to investigate the target repo
  • Evaluates the agent's trajectory against gold annotations via ContextBench's Python evaluation
  • Reports file/symbol/span coverage and precision metrics
  • Optionally emits metrics to CloudWatch for trending

Includes 6 built-in configs testing different context management strategies (control, offloader, offloader-aggressive, summarizing, sliding-proactive, offloader-summarizing, these will be updated to the built in context management strategies!!!!), support for custom agent files via --agent-file, configurable model via --model, and a --min-coverage flag for future CI gating.

Usage:

strandly benchmark --suite contextbench
strandly benchmark --suite contextbench --agent-file ./my-agent.ts --cloudwatch

Related Issues

N/A

Documentation PR

N/A — README included at strandly/src/benchmark/README.md

Type of Change

New feature

Testing

How have you tested the change?

  • Ran end-to-end with Bedrock Claude Sonnet 4 on django__django-15987 task — achieved 100% file coverage
  • Verified CloudWatch metric emission
  • Verified --agent-file custom config loading
  • Verified --min-coverage threshold gating
  • Verified error handling (missing API keys, invalid model IDs, timeouts)
  • TypeScript type-checks clean (npx tsc --noEmit --project strandly/tsconfig.json)
  • I ran npm run check

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@lizradway lizradway temporarily deployed to manual-approval May 19, 2026 17:48 — with GitHub Actions Inactive
@github-actions github-actions Bot added the strands-running <strands-managed> Whether or not an agent is currently running label May 21, 2026
Comment thread strandly/package.json
Comment thread strandly/src/benchmark/contextbench/loader.ts
Comment thread strandly/src/benchmark/contextbench/loader.ts
Comment thread strandly/src/benchmark/contextbench/trajectory.ts
Comment thread strandly/src/benchmark/contextbench/trajectory.ts
Comment thread strandly/src/benchmark/index.ts
Comment thread strandly/src/cli.ts
Comment thread strandly/src/benchmark/runner.ts
@github-actions

Copy link
Copy Markdown
Contributor

Assessment: Comment

This is a well-structured benchmarking tool for the strandly CLI. The code is organized cleanly with good separation of concerns (loader, runner, evaluator, reporter, cloudwatch). The types are well-defined and the README documentation is thorough.

Review Categories
  • Testing: No unit tests are included for any of the pure functions (trajectory extraction, report generation, metric parsing). Even for hackathon/prototype code, the regex-heavy trajectory logic would benefit from test coverage.
  • Code Hygiene: Unused zod dependency, unused ROOT constant, and .cache/ directory not gitignored.
  • Composability: process.exit() calls in library functions make the code harder to test and reuse — throwing errors and handling at the CLI boundary would be more idiomatic.
  • Robustness: Timer leak in Promise.race timeout pattern, and no validation of --min-coverage input.

Nice addition to the developer tooling — the integration with ContextBench's evaluation framework and the CloudWatch metrics emission for trending are particularly useful.

@github-actions github-actions Bot removed the strands-running <strands-managed> Whether or not an agent is currently running label May 21, 2026
@agent-of-mkmeral

Copy link
Copy Markdown
Contributor

Really nice addition — clean separation (loader/runner/evaluator/reporter/cloudwatch), thorough README, and integrating ContextBench is exactly the right call. Since it's a draft devtool I skipped style nits and focused on bugs that would make the benchmark numbers wrong, because that's what makes a benchmark misleading rather than just rough. I checked out the branch locally and tested the trajectory parser empirically.

TL;DR: three issues in the measurement layer systematically bias the cross-config comparison the tool exists to make. None block merging as a prototype — but I'd hold off publishing the numbers as a strategy comparison until #1#3 are addressed.


🔴 Critical (these change the scores)

1. Trajectory is reconstructed from agent.messages after the run — but conversation managers mutate that array in place.
When the run is long enough to trigger context reduction (exactly what these strategies target), the early toolUse blocks are already gone by extraction time, so the agent's early file reads aren't counted. Importantly, ContextOffloader only rewrites toolResult content and leaves toolUse intact, while SlidingWindow/Summarizing splice messages out — so offloader configs keep their full trajectory and the others don't. That's a structural advantage unrelated to investigation quality.

2. The bash path regex misses most common investigation commands.
grep/rg/ls/find/awk contribute zero files, head -n/tail -n lose the file, multi-file cat keeps only the first, and less +100 file.py injects a garbage path. Coverage ends up reflecting the model's bash phrasing more than what it actually found.

3. Symbol / Span / EditLoc metrics are structurally always 0.000 (spans/symbols hardcoded empty in evaluator.ts) but the report table + README present them as real measurements.

Details, evidence & suggested fixes for #1#3

#1 — post-run trajectory extraction vs. in-place mutation

// runner.ts — runs AFTER agent.invoke() completes
const trajectory = extractTrajectory(agent.messages, repoDir)
  • SlidingWindowConversationManagermessages.splice(0, trimIndex) (sliding-window-conversation-manager.ts:268) removes oldest messages.
  • SummarizingConversationManagermessages.splice(0, messagesToSummarizeCount, summaryMessage) (summarizing-conversation-manager.ts:173) replaces oldest ~30% with a summary.
  • ContextOffloader → only edits toolResult content via AfterToolCallEvent; toolUse blocks survive (plugin.ts).

Net: coverage is under-counted for compressing configs and under-counted more the more aggressively they compress — so the comparison is apples-to-oranges. Suggested fix: capture tool calls live with a BeforeToolCallEvent/AfterToolCallEvent hook into a side list, instead of reconstructing from the post-run message array. This is the single highest-impact change.

#2 — regex parser (tested locally against extractFilePathsFromToolCall)

Command Extracted Should be
head -n 50 forms/models.py [] models.py
tail -n 100 file.py [] file.py
cat a.py b.py c.py [a.py] all three
grep -rn 'def clean' forms/ [] (search hits)
rg 'class ModelForm' [] (search hits)
less +100 file.py [+100] ⚠️ file.py
cat ./forms/models.py ./forms/... forms/...

The -n flag is captured as the path then dropped by the startsWith('-') guard, so the file is silently lost. Suggested fix: strip flags (-n, -50, +100) before treating tokens as paths, handle multiple files per command, and capture grep/rg/find targets — or better, have the agent read through a structured tool you can parse reliably rather than reverse-engineering bash.

#3 — always-zero metrics

runner.ts calls evaluate(task, fileList) with no spans, and evaluator.ts hardcodes spans: spans ?? {}, symbols: {}, so symbol/span/editloc are always 0. The reporter.ts table and README still list them as real metrics. Suggested fix: either populate spans/symbols (the TrajectoryEntry already has startLine/endLine fields) or drop those rows from the report + README until they're wired up.


🟠 Worth addressing for reproducibility

#4 nondeterminism  ·  #5 path normalization

4. BedrockModel({ modelId, stream: false }) doesn't set temperature, and each config runs once on a single task. Run-to-run sampling variance will look like config differences (especially in CloudWatch trends). Consider temperature: 0, and ideally multiple seeds/tasks with reported variance — otherwise a 5–10% swing is within noise.

5. toRelativePath doesn't strip ./ or resolve relative paths, so ./foo.py or a bare path after cd won't match gold foo.py — counts as both a miss and a false positive. Normalize to repo-relative POSIX paths before comparing.

🟡 Minor (fine for a devtool)

#6 timer leak  ·  #7 token definition  ·  #8 double checkout
  • 6. runner.ts clears timeoutId on the success path but not in the catch, leaving the 10-min timer pending on errors (also flagged by the bot).
  • 7. CloudWatch TokenUsage = input+output, but the README/reporter frame "Input Tokens" as the cost proxy — pick one definition so trend lines stay consistent.
  • 8. git checkout A || git fetch && git checkout A parses as (checkout || fetch) && checkout, so the happy path checks out twice. Harmless, just wasteful.

Great foundation overall — fixing #1 (event-hook trajectory capture) and #2 (flag/multi-file/grep handling) would make the cross-config comparison trustworthy, and #3 is a quick "implement or hide" decision. Happy to help if useful!

lizradway pushed a commit to lizradway/sdk-typescript that referenced this pull request Jun 1, 2026
Partial fix to strands-agents#1069 - previously the agent would prematurely exit if the agent generated a tool with an invalid name; this avoids that by ensuring the agent loop continues with zero tool-uses.

---------

Co-authored-by: Mackenzie Zastrow <zastrowm@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants